Text analytics

The term text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation.^[1] The term is roughly synonymous with text mining; indeed, Prof. Ronen Feldman modified a 2000 description of "text mining"^[2] in 2004 to describe "text analytics."^[3] The latter term is now used more frequently in business settings while "text mining" is used in some of the earliest application areas, dating to the 1980s,^[4] notably life-sciences research and government intelligence.

Text analytics involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation, information extraction, data mining techniques including link and association analysis, visualization, and predictive analytics. The overarching goal is, essentially, to turn text into data for analysis via application of natural language processing (NLP) and analytical methods.

The term also describes that application of text analytics to respond to business problems, whether independently or in conjunction with query and analysis of fielded, numerical data. It is a truism that 80 percent of business-relevant information originates in unstructured form, primarily text.^[5] These techniques and processes discover and present knowledge – facts, business rules, and relationships – that is otherwise locked in textual form, impenetrable to automated processing.

A typical application is to scan a set of documents written in a natural language and either model the document set for predictive classification purposes or populate a database or search index with the information extracted.

1 History
2 Text Analysis Processes
3 Applications
4 Software
- 4.1 Commercial Software
- 4.2 Open-Source Software
5 See also
6 Notes
7 External links

History

The challenge exploiting the large proportion of enterprise information that originates in "unstructured" form has been recognized for decades.^[6] It is recognized in the earliest definition of business intelligence (BI), in an October 1958 IBM Journal article by H.P. Luhn, A Business Intelligence System, which describes a system that will:

"...utilize data-processing machines for auto-abstracting and auto-encoding of documents and for creating interest profiles for each of the 'action points' in an organization. Both incoming and internally generated documents are automatically abstracted, characterized by a word pattern, and sent automatically to appropriate action points."

Yet as management information systems developed starting in the 1960s, and as BI emerged in the '80s and '90s as a software category and field of practice, the emphasis was on numerical data stored in relational databases. This is not surprising: text in "unstructured" documents is hard to process. The emergence of text analytics in its current form stems from a refocusing of research in the late 1990s from algorithm development to application, as described by Prof. Marti A. Hearst in the paper Untangling Text Data Mining:^[7]

For almost a decade the computational linguistics community has viewed large text collections as a resource to be tapped in order to produce better text analysis algorithms. In this paper, I have attempted to suggest a new emphasis: the use of large online text collections to discover new facts and trends about the world itself. I suggest that to make progress we do not need fully artificial intelligent text analysis; rather, a mixture of computationally-driven and user-guided analysis may open the door to exciting new results.

Hearst's 1999 statement of need fairly well describes the state of text analytics technology and practice a decade later.

Text Analysis Processes

Subtasks — components of a larger text-analytics effort — typically include:

Information Retrieval or identification of a corpus is a preparatory step: collecting or identifying a set textual materials, on the Web or held in a file system, database, or content management system, for analysis.
Named Entity Recognition is the use of gazetteers or statistical techniques to identify named text features: people, organizations, place names, stock ticker symbols, certain abbreviations, and so on. Disambiguation — the use of contextual clues — may be required to decide where, for instance, "Ford" refers to a former U.S. president, a vehicle manufacturer, a movie star (Glenn or Harrison?) or some other entity.
Recognition of Pattern Identified Entities: Features such as telephone numbers, e-mail addresses, quantities (with units) can be discerned via regular expression or other pattern matches.
Coreference: identification of noun phrases and other terms that refer to the same object.
Relationship, Fact, and Event Extraction: identification of associations among entities and other information in text
Sentiment Analysis involves discerning subjective (as opposed to factual) material and extracting various forms of attitudinal information: sentiment, opinion, mood, and emotion. Text analytics techniques are helpful in analyzing sentiment at the entity, concept, or topic level and in distinguishing opinion holder and opinion object.^[8]

Applications

The technology is now broadly applied for a wide variety of government, research, and business needs. Applications can be sorted into a number of categories by analysis type or by business function. Using this approach to classifying solutions, application categories include:

Enterprise Business Intelligence/Data Mining, Competitive Intelligence
E-Discovery, Records Management
National Security/Intelligence
Scientific Discovery, especially Life Sciences
Sentiment Analysis Tools, Listening Platforms
Natural Language/Semantic Toolkit or Service
Publishing
Search/Information Access

Software

There are many text analytics research, commercial, and open source software options. Some are comprehensive solutions; others handle particular subtasks.

Commercial Software

AeroText - provides a suite of text mining applications for content analysis. Content used can be in multiple languages.
Attensity - hosted, integrated and stand-alone text analytics software that uses natural language processing technology to address collective intelligence in social media and forums; the voice of the customer in surveys and emails; customer relationship management; e-services; research and e-discovery; risk and compliance; and intelligence analysis.
Clarabridge - provides SaaS, hosted and on-premise text and sentiment analytics that enables companies to collect, listen to, analyze, and act on the Voice of the Customer (VOC) from both external (Twitter, Facebook, Yelp!, product forums, etc.) and internal sources (call center notes, CRM, Enterprise Data Warehouse, BI, surveys, emails, etc.).
General Sentiment - technology company that produces comprehensive research products to help marketing, sales and communications executives evaluate their brand performance in the media
IBM LanguageWare - the IBM suite for text analytics (tools and Runtime).
IBM SPSS - provider of PASW Text Analytics for Surveys and PASW Text Analytics, Advanced NLP-based text analysis software (multi-lingual sentiment, event and fact extraction), that can be used in conjunction with SPSS Predictive Analysis Solutions.
Language Computer Corporation – provides a suite of customizable text extraction and analysis tools including natural language search, available in multiple languages.
Lexalytics - provides commercial sentiment analysis for many OEM and direct customers including analysis of financial news feeds for the Thomson Reuters RMDS trading information system.
MeshLabs - MeshLabs develops text analytics solutions that discover information from unstructured data and deliver highly relevant personalized knowledge and actionable insights from any given content source, channel, and type.
SAS - a leading business intelligence and business analytics provider, SAS provides text analysis capabilities with the Enterprise Miner data-mining workbench and via Teragram linguistic-analysis tools.
StatSoft - provides a Text Miner extension to the STATISTICA Data Miner product. STATISTICA Text Miner features text retrieval, pre-processing, and analytic procedures for unstructured text data; with options to convert text into numeric information for mapping, clustering, and predictive data mining.
Sysomos - provider social media analytics software platform, including text analytics and sentiment analysis on online consumer conversations.

Open-Source Software

GATE - General Architecture for Text Engineering, an open-source toolbox for natural language processing
Text Engineering Software Laboratory (Tesla) - A component framework for experiments in natural language processing
Apache UIMA - Unstructured Information Management Architecture
Natural Language Toolkit - open-source Python modules, linguistic data and documentation for text analytics
RapidMiner - open-source software for data and text mining

Text analytics

Contents

History

Text Analysis Processes

Applications

Software

Commercial Software

Open-Source Software

See also

Notes

External links